It's not just for stats: Mining the web with R

David Springate
May 2014

Summary

  1. What is web mining?
  2. Why use R for web mining?
  3. Web mining toolkit
  4. Example - Downloading multiple files
  5. Case study - Web mining for a new job
  6. Further info

What is web mining?

“Using data mining techniques to discover patterns from the web”

Getting hold of scientific data:

  • Using web APIs (Pubmed, Genbank, Google Geocoder etc.)
  • Screen scraping (traversing and processing data from web pages)
  • Downloading files from the web (HTML, XML, csv etc.)

Why do web mining?

Baffled

“Let me Google that for you…”

sources

Why use R to do web mining?

Bush

Why use R to do web mining?

Web mining is:

  • Text processing
    • Tree/list processing to parse webpages
    • Regular expressions to clean up and filter data
  • Sending HTTP requests to get webpages
  • Aggregation to combine multiple datasets
  • R has either good basic functionality or good libraries to do all of this

  • Lower cognitive load?

mining

Web mining toolkit

Text processing

  • base : paste, strsplit, sprintf etc.
  • XML : Working with XML and HTML - readHTMLTable, htmlTreeParse, xpathSApply
  • stringr : Sane regular expressions - str_detect, str_replace_all, str_match_all, …

Web requests

  • RCurl : Get web pages and download files - getURL, curlPerform, CFILE
  • httr : Dealing with APIs

Misc

  • rjson : Convert JSON documents to R objects - toJSON, fromJSON

Example: Downloading multiple files

clinicalcodes

Example: Downloading multiple files

  1. Download page and translate to R
  2. Scrape out all the links on the page
  3. Filter to those which are for downloading data
  4. Download links
# define your webpage
basename <- "www.mysite.com"
address <- paste0(basename, "/data")

# Download the page
page <- getURL(address)

# Convert to R
tree <- htmlParse(page)

Example: Downloading multiple files

  1. Download page and translate to R
  2. Scrape out all the links on the page
  3. Filter to those which are for downloading data
  4. Download links

All HTML elements on the page with:

<a href=“/path/”>link</a>

## Get All link elements
links <- xpathSApply(tree,        
             path = "//*/a", 
             fun = xmlGetAttr, 
             name = "href")

## Convert to vector
links <- unlist(links)

Example: Downloading multiple files

  1. Download page and translate to R
  2. Scrape out all the links on the page
  3. Filter to those which are for downloading data
  4. Download links

Filter all links containing the word “download”

## Use regex to filter
## vectorised function
links <- links[str_detect(links,               
                        "download")]

Example: Downloading multiple files

  1. Download page and translate to R
  2. Scrape out all the links on the page
  3. Filter to those which are for downloading data
  4. Download links
for(link in links){
    dat <- getURL(link)
    f <- read.csv(text = dat)
    write.csv(f, file = basename(link))
}

Working with web APIs

  • A set of HTTP requests with structured responses
    • XML
    • JSON

Methods

  1. Construct a web query (paste etc.)
  2. Get the webpage
  3. Parse the XML or JSON

A huge number of Science web APIs have already been wrapped in R and are generally easy to use:

  • Bioconductor
  • rOpenSci
  • rOpenHealth
  • rOpenGov

httr package is great for building R APIs!

Case study: Web mining for a new job

Need_a_job

I need a new job!

But where to send my CV?

  • Universities with groups doing electronic medical records research
  • Publishing good quality research in this field
  • Highly ranked institutions
  • In Europe

Case study: Web mining for a new job

  1. Find articles in my field on Pubmed
  2. Download article data from Pubmed and extract:
    • journal name and correspondence address
  3. Get Impact factors for these journals
  4. Select articles from highly ranked European Universities
  5. Get coordinates for these universities
  6. Aggregate datasets and rank by IF, rank and publications
  7. VISUALISE…

Find articles in my field on Pubmed

rentrez: a wrapper to the NCBI entrez service to get the Pubmed IDs

Get 5000 most recent papers containing the term “Electronic medical records” in the title or abstract:

require(rentrez)
paper_ids <- entrez_search(db = "pubmed", 
             term = "electronic medical records", 
             retmax = 5000)$ids

Returns a vector of pubmed IDs for the papers

Find articles in my field on Pubmed

rpubmed: tools for downloading and processing Pubmed data

Download the identified articles as JSON and convert to R data structures:

require(devtools)
install_github(
    "rOpenHealth/rpubmed")
require(rpubmed)

# Download articles with corresponding IDs 
records <- fetch_in_chunks(paper_ids)                 

Returns a large list (230 Mb) of records, each of which is a set of nested lists…

Find articles in my field on Pubmed

Then you can traverse the data using normal list processing tools:

## Create an address vector:
addresses <- as.character(sapply(records, 
                                 function(x) x$MedlineCitation$Article$AuthorList$Author$Affiliation))

## Create a journal vector 
journals <- as.character(sapply(records, 
                                function(x) x$MedlineCitation$MedlineJournalInfo$ISSNLinking))


Clean up with regex and combine to a dataframe…

Get journal impact factors

XML::readHTMLTable() downloads all html tables from a webpage and stored them as a list of dataframes!

impact_url <- "http://www.citefactor.org/impact-factor-list-2012.html"

# HTMLtable -> df!
impacts <- readHTMLTable(impact_url)[[1]]

#Keep relevant columns
impacts <- impacts[, c(1,2,4)]


impacts

Get institution rankings

rankings_url <- "http://www.researchranking.org/?action=ranking" 

## No header in the table
rankings <- readHTMLTable(rankings_url, header = FALSE, stringsAsFactors = FALSE)[[1]]
names(rankings)  <-  c("rank", "institution", "type", "country", "score")

## clean up
rankings$institution <- str_replace(rankings$institution, "^THE ", "")

Merge impact factor data...

## From Pubmed:
jobs <- data.frame(address = addresses,                                      
                   ISSN = journals)
## Merge in impact factors
jobs <- merge(jobs, 
              impacts, 
              all.x = TRUE)
## remove null values
jobs <- jobs[complete.cases(jobs),]
jobs <- jobs[jobs$address != "NULL",]

Merge institution score data...

# Rankings are more complicated:
jobs$institution_score <- NA                                                                  
jobs$institutions <- NA
jobs$countries <- NA
## Loop over the 100 top institutions:
for(institution in 1:nrow(rankings)){
    # Which article addresses match the institution?                
    new_score <- str_detect(jobs$address, 
                ignore.case(rankings[institution, 
                                     "institution"]))                        
    if(sum(new_score)){  
        jobs$institution_score[new_score] <- 
            rankings[institution, "score"]
        jobs$institutions[new_score] <- 
            rankings[institution, "institution"]
        jobs$countries[new_score] <- 
            rankings[institution, "country"]
    }
}
selected <- jobs[!is.na(jobs$institution_score),]

Geocoding

The ggmaps package has a vectorised R interface to the Google geocodeing API:

require(ggmaps)

# returns a dataframe of coordinates
coords <- geocode(paste(selected$institution,    
                 selected$country))

## bind to original data:
selected <- cbind(selected, coords)

lonlat

The final dataset

data

Aggregated with reshape2
Ranked and sorted according to institution score,
mean impact factor and number of publications.

Map

The web is big... Really big...

data_deluge

  • Find the data you want and pare it down to manageable information
  • R has all the tools to do this efficiently at the small (researcher) scale
  • Functional programming paradigm is well suited to this
  • Generally not limited by processor speed

Helpful to know about...

  • Regular expressions
  • HTML
    • Browser tools
  • HTTP requests
    • APIS (GET, PUT, REST, JSON etc.)
  • Xpath traversing HTML/XML

Thank you!

plug

@datajujitsu
https://gist.github.com/DASpringate/11253206

rOpenHealth: A collaborative project to build R tools facilitating access to quantitative healthcare and public health data

  • Aim to wrap all Healthcare and public health APIs
  • Building tools to access healthcare data
  • Join us?

https://github.com/rOpenHealth/